{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8a5706df",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/evaluation/retrieval/retriever_eval.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36129c05-81f2-466c-a507-b62a577199d8",
   "metadata": {},
   "source": [
    "# Retrieval Evaluation\n",
    "\n",
    "This notebook uses our `RetrieverEvaluator` to evaluate the quality of any Retriever module defined in LlamaIndex.\n",
    "\n",
    "We specify a set of different evaluation metrics: this includes hit-rate, MRR, Precision, Recall, AP, and NDCG. For any given question, these will compare the quality of retrieved results from the ground-truth context.\n",
    "\n",
    "To ease the burden of creating the eval dataset in the first place, we can rely on synthetic data generation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8659681a-7141-4f80-9bbe-8eddc061a134",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Here we load in data (PG essay), parse into Nodes. We then index this data using our simple vector index and get a retriever."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9df86266",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai\n",
    "%pip install llama-index-readers-file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "285cfab2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f63b16c-6a83-4ef0-a451-43c2c3d9c828",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "from llama_index.llms.openai import OpenAI"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4e3b3f28",
   "metadata": {},
   "source": [
    "Download Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "589c112d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p 'data/paul_graham/'\n",
    "!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab50ac91-e9d4-4fae-a519-db5711a13210",
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66499039-d76d-4914-b03a-bcbd10c8c33b",
   "metadata": {},
   "outputs": [],
   "source": [
    "node_parser = SentenceSplitter(chunk_size=512)\n",
    "nodes = node_parser.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af31b424-02c4-4731-beca-e88ef4f202ca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# by default, the node ids are set to random uuids. To ensure same id's per run, we manually set them.\n",
    "for idx, node in enumerate(nodes):\n",
    "    node.id_ = f\"node_{idx}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1268ad2-3b29-49dc-92b3-6894900b4534",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = OpenAI(model=\"gpt-4\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11836d3f-136d-45fe-bad8-f0480751ee67",
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_index = VectorStoreIndex(nodes)\n",
    "retriever = vector_index.as_retriever(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "baf75ec6-7419-4975-83df-6d6fa08adb77",
   "metadata": {},
   "source": [
    "### Try out Retrieval\n",
    "\n",
    "We'll try out retrieval over a simple dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd260990-0aea-490b-99e0-d7517f668020",
   "metadata": {},
   "outputs": [],
   "source": [
    "retrieved_nodes = retriever.retrieve(\"What did the author do growing up?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce4b0823-8be4-4dd4-8486-8e73d79590fe",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "**Node ID:** node_38<br>**Similarity:** 0.814377909267451<br>**Text:** I also worked on spam filters, and did some more painting. I used to have dinners for a group of friends every thursday night, which taught me how to cook for groups. And I bought another building in Cambridge, a former candy factory (and later, twas said, porn studio), to use as an office.\n",
       "\n",
       "One night in October 2003 there was a big party at my house. It was a clever idea of my friend Maria Daniels, who was one of the thursday diners. Three separate hosts would all invite their friends to one party. So for every guest, two thirds of the other guests would be people they didn't know but would probably like. One of the guests was someone I didn't know but would turn out to like a lot: a woman called Jessica Livingston. A couple days later I asked her out.\n",
       "\n",
       "Jessica was in charge of marketing at a Boston investment bank. This bank thought it understood startups, but over the next year, as she met friends of mine from the startup world, she was surprised how different reality was. And ho...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/markdown": [
       "**Node ID:** node_0<br>**Similarity:** 0.8122448657654567<br>**Text:** What I Worked On\n",
       "\n",
       "February 2021\n",
       "\n",
       "Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n",
       "\n",
       "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called \"data processing.\" This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n",
       "\n",
       "The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...<br>"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from llama_index.core.response.notebook_utils import display_source_node\n",
    "\n",
    "for node in retrieved_nodes:\n",
    "    display_source_node(node, source_length=1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5371db56-2b1c-497a-8fd0-a1a69b2ce773",
   "metadata": {},
   "source": [
    "## Build an Evaluation dataset of (query, context) pairs\n",
    "\n",
    "Here we build a simple evaluation dataset over the existing text corpus.\n",
    "\n",
    "We use our `generate_question_context_pairs` to generate a set of (question, context) pairs over a given unstructured text corpus. This uses the LLM to auto-generate questions from each context chunk.\n",
    "\n",
    "We get back a `EmbeddingQAFinetuneDataset` object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a25924cf-7eeb-4160-a035-4a69ee1e46de",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation import (\n",
    "    generate_question_context_pairs,\n",
    "    EmbeddingQAFinetuneDataset,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d29a159-9a4f-4d44-9c0d-1cd683f8bb9b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 61/61 [06:10<00:00,  6.08s/it]\n"
     ]
    }
   ],
   "source": [
    "qa_dataset = generate_question_context_pairs(\n",
    "    nodes, llm=llm, num_questions_per_chunk=2\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f32d458b-50ad-426c-a969-e9fe8fb5861a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"Describe the transition from using the IBM 1401 to microcomputers, as mentioned in the text. What were the key differences and how did these changes impact the user's interaction with the computer?\"\n"
     ]
    }
   ],
   "source": [
    "queries = qa_dataset.queries.values()\n",
    "print(list(queries)[2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a900650-38ed-405e-936c-08e48e0fb8ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "# [optional] save\n",
    "qa_dataset.save_json(\"pg_eval_dataset.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "713c1b71-2ab6-42a0-bde3-ab3bfe880f97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# [optional] load\n",
    "qa_dataset = EmbeddingQAFinetuneDataset.from_json(\"pg_eval_dataset.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3267fd6-187c-4e11-9f80-cfce08d98f1f",
   "metadata": {},
   "source": [
    "## Use `RetrieverEvaluator` for Retrieval Evaluation\n",
    "\n",
    "We're now ready to run our retrieval evals. We'll run our `RetrieverEvaluator` over the eval dataset that we generated."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16e56d17-2aa9-4e83-94ee-702088664b3c",
   "metadata": {},
   "source": [
    "We define two functions: `get_eval_results` and also `display_results` that run our retriever over the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17b870a4-095c-47b9-95b9-ae44cc0c014a",
   "metadata": {},
   "outputs": [],
   "source": [
    "include_cohere_rerank = False\n",
    "\n",
    "if include_cohere_rerank:\n",
    "    !pip install cohere -q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42fa6f9c-a6da-43c4-b4c6-748ada393490",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation import RetrieverEvaluator\n",
    "\n",
    "metrics = [\"hit_rate\", \"mrr\", \"precision\", \"recall\", \"ap\", \"ndcg\"]\n",
    "\n",
    "if include_cohere_rerank:\n",
    "    metrics.append(\n",
    "        \"cohere_rerank_relevancy\"  # requires COHERE_API_KEY environment variable to be set\n",
    "    )\n",
    "\n",
    "retriever_evaluator = RetrieverEvaluator.from_metric_names(\n",
    "    metrics, retriever=retriever\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a16f3351-d745-46b3-b53b-916f7c244def",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Query: Describe the author's initial experiences with programming on the IBM 1401. What challenges did he face and how did these experiences shape his understanding of programming?\n",
      "Metrics: {'hit_rate': 1.0, 'mrr': 1.0, 'precision': 0.5, 'recall': 1.0, 'ap': 1.0, 'ndcg': 0.6131471927654584}\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# try it out on a sample query\n",
    "sample_id, sample_query = list(qa_dataset.queries.items())[0]\n",
    "sample_expected = qa_dataset.relevant_docs[sample_id]\n",
    "\n",
    "eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)\n",
    "print(eval_result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3963d146-c4a3-4b00-8b53-52c6ec03d862",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Retrying llama_index.embeddings.openai.base.aget_embedding in 0.6914689476274432 seconds as it raised RateLimitError: Error code: 429 - {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 3 seconds.'}.\n",
      "Retrying llama_index.embeddings.openai.base.aget_embedding in 1.072244476250501 seconds as it raised RateLimitError: Error code: 429 - {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 3 seconds.'}.\n",
      "Retrying llama_index.embeddings.openai.base.aget_embedding in 0.8123380504307198 seconds as it raised RateLimitError: Error code: 429 - {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 4 seconds.'}.\n",
      "Retrying llama_index.embeddings.openai.base.aget_embedding in 0.9520260756712478 seconds as it raised RateLimitError: Error code: 429 - {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 6 seconds.'}.\n",
      "Retrying llama_index.embeddings.openai.base.aget_embedding in 1.3700745779005286 seconds as it raised RateLimitError: Error code: 429 - {'statusCode': 429, 'message': 'Rate limit is exceeded. Try again in 4 seconds.'}.\n"
     ]
    }
   ],
   "source": [
    "# try it out on an entire dataset\n",
    "eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddbc38c7-660e-451b-8305-e8a7f23510a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "def display_results(name, eval_results):\n",
    "    \"\"\"Display results from evaluate.\"\"\"\n",
    "\n",
    "    metric_dicts = []\n",
    "    for eval_result in eval_results:\n",
    "        metric_dict = eval_result.metric_vals_dict\n",
    "        metric_dicts.append(metric_dict)\n",
    "\n",
    "    full_df = pd.DataFrame(metric_dicts)\n",
    "\n",
    "    columns = {\n",
    "        \"retrievers\": [name],\n",
    "        **{k: [full_df[k].mean()] for k in metrics},\n",
    "    }\n",
    "\n",
    "    if include_cohere_rerank:\n",
    "        crr_relevancy = full_df[\"cohere_rerank_relevancy\"].mean()\n",
    "        columns.update({\"cohere_rerank_relevancy\": [crr_relevancy]})\n",
    "\n",
    "    metric_df = pd.DataFrame(columns)\n",
    "\n",
    "    return metric_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d059d5ee-d2aa-4edf-8b9b-e09d29ff17b2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>retrievers</th>\n",
       "      <th>hit_rate</th>\n",
       "      <th>mrr</th>\n",
       "      <th>precision</th>\n",
       "      <th>recall</th>\n",
       "      <th>ap</th>\n",
       "      <th>ndcg</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>top-2 eval</td>\n",
       "      <td>0.770492</td>\n",
       "      <td>0.655738</td>\n",
       "      <td>0.385246</td>\n",
       "      <td>0.770492</td>\n",
       "      <td>0.655738</td>\n",
       "      <td>0.420488</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   retrievers  hit_rate       mrr  precision    recall        ap      ndcg\n",
       "0  top-2 eval  0.770492  0.655738   0.385246  0.770492  0.655738  0.420488"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "display_results(\"top-2 eval\", eval_results)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
